Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery. [http://machinelearningmastery.com/]
SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Forest Cover Type dataset is a multi-class classification situation where we are trying to predict one of the seven possible outcomes.
INTRODUCTION: This experiment tries to predict forest cover type from cartographic variables only. This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices.
The actual forest cover type for a given observation (30 x 30-meter cell) was determined from the US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Independent variables were derived from data originally obtained from the US Geological Survey (USGS) and USFS data. Data is in raw form (not scaled) and contains binary (0 or 1) columns of data for qualitative independent variables (wilderness areas and soil types).
In iteration Take1, we established the baseline accuracy for comparison with future rounds of modeling.
In iteration Take2, we examined the feature selection technique of attribute importance ranking by using the Gradient Boosting algorithm. By selecting the most important attributes, we decreased the modeling time and still maintained a similar level of accuracy when compared to the baseline model.
In iteration Take3, we will examine the feature selection technique of recursive feature elimination (RFE) by using the Random Forest algorithm. By selecting no more than 30 attributes, we hope to maintain a similar level of accuracy when compared to the baseline model.
ANALYSIS: From iteration Take1, the baseline performance of the machine learning algorithms achieved an average accuracy of 78.04%. Two algorithms (Random Forest and Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in the top overall result and achieved an accuracy metric of 85.48%. By using the optimized parameters, the Random Forest algorithm processed the testing dataset with an accuracy of 86.07%, which was even better than the predictions from the training data.
From iteration Take2, the performance of the machine learning algorithms achieved an average accuracy of 74.27%. Random Forest achieved an accuracy metric of 85.47% with the training data and processed the testing dataset with an accuracy of 85.85%, which was even better than the predictions from the training data. At the importance level of 99%, the attribute importance technique eliminated 22 of 54 total attributes. The remaining 32 attributes produced a model that achieved a comparable accuracy compared to the baseline model. The modeling time went from 1 hour 19 minutes down to 58 minutes, a reduction of 36.2%.
From the current iteration, the performance of the machine learning algorithms achieved an average accuracy of 73.25%. Random Forest achieved an accuracy metric of 84.24% with the training data and processed the testing dataset with an accuracy of 84.77%, which was even better than the predictions from the training data. The RFE technique eliminated 42 of 54 total attributes. The remaining 12 attributes produced a model that achieved a comparable accuracy compared to the baseline model. The modeling time went from 1 hour 19 minutes down to 33 minutes, a reduction of 58.2%.
CONCLUSION: For this iteration, the Random Forest algorithm achieved the best overall results using the training and testing datasets. For this dataset, Random Forest should be considered for further modeling.
Dataset Used: Covertype Data Set
Dataset ML Model: Multi-Class classification with numerical attributes
Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Covertype
One source of potential performance benchmarks: https://www.kaggle.com/c/forest-cover-type-prediction/overview
The project aims to touch on the following areas:
Any predictive modeling machine learning project genrally can be broken down into about six major tasks:
startTimeScript <- proc.time()
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
library(corrplot)
## corrplot 0.84 loaded
library(DMwR)
## Loading required package: grid
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(Hmisc)
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
library(mailR)
## Registered S3 method overwritten by 'R.oo':
## method from
## throw.default R.methodsS3
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
library(stringr)
# Create one random seed number for reproducible results
seedNum <- 888
set.seed(seedNum)
email_notify <- function(msg=""){
sender <- Sys.getenv("MAIL_SENDER")
receiver <- Sys.getenv("MAIL_RECEIVER")
gateway <- Sys.getenv("SMTP_GATEWAY")
smtpuser <- Sys.getenv("SMTP_USERNAME")
password <- Sys.getenv("SMTP_PASSWORD")
sbj_line <- "Notification from R Binary Classification Script"
send.mail(
from = sender,
to = receiver,
subject= sbj_line,
body = msg,
smtp = list(host.name = gateway, port = 587, user.name = smtpuser, passwd = password, ssl = TRUE),
authenticate = TRUE,
send = TRUE)
}
# Set up the muteEmail flag to stop sending progress emails (setting FALSE will send emails!)
notifyStatus <- FALSE
if (notifyStatus) email_notify(paste("Library and Data Loading has begun!",date()))
# Slicing up the document path to get the final destination file name
dataset_path <- 'https://www.kaggle.com/c/forest-cover-type-prediction/download/train.csv'
doc_path_list <- str_split(dataset_path, "/")
dest_file <- doc_path_list[[1]][length(doc_path_list[[1]])]
if (!file.exists(dest_file)) {
# Download the document from the website
cat("Downloading", dataset_path, "as", dest_file, "\n")
download.file(dataset_path, dest_file, mode = "wb")
cat(dest_file, "downloaded!\n")
# unzip(dest_file)
# cat(dest_file, "unpacked!\n")
}
inputFile <- dest_file
Xy_original <- read.csv(inputFile, sep=',', header=TRUE, row.names=1)
Xy_original$Cover_Type <- as.factor(Xy_original$Cover_Type)
# Take a peek at the dataframe after the import
head(Xy_original)
## Elevation Aspect Slope Horizontal_Distance_To_Hydrology
## 1 2596 51 3 258
## 2 2590 56 2 212
## 3 2804 139 9 268
## 4 2785 155 18 242
## 5 2595 45 2 153
## 6 2579 132 6 300
## Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways
## 1 0 510
## 2 -6 390
## 3 65 3180
## 4 118 3090
## 5 -1 391
## 6 -15 67
## Hillshade_9am Hillshade_Noon Hillshade_3pm
## 1 221 232 148
## 2 220 235 151
## 3 234 238 135
## 4 238 238 122
## 5 220 234 150
## 6 230 237 140
## Horizontal_Distance_To_Fire_Points Wilderness_Area1 Wilderness_Area2
## 1 6279 1 0
## 2 6225 1 0
## 3 6121 1 0
## 4 6211 1 0
## 5 6172 1 0
## 6 6031 1 0
## Wilderness_Area3 Wilderness_Area4 Soil_Type1 Soil_Type2 Soil_Type3
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
## Soil_Type4 Soil_Type5 Soil_Type6 Soil_Type7 Soil_Type8 Soil_Type9
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## Soil_Type10 Soil_Type11 Soil_Type12 Soil_Type13 Soil_Type14 Soil_Type15
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 1 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## Soil_Type16 Soil_Type17 Soil_Type18 Soil_Type19 Soil_Type20 Soil_Type21
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## Soil_Type22 Soil_Type23 Soil_Type24 Soil_Type25 Soil_Type26 Soil_Type27
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## Soil_Type28 Soil_Type29 Soil_Type30 Soil_Type31 Soil_Type32 Soil_Type33
## 1 0 1 0 0 0 0
## 2 0 1 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 1 0 0 0
## 5 0 1 0 0 0 0
## 6 0 1 0 0 0 0
## Soil_Type34 Soil_Type35 Soil_Type36 Soil_Type37 Soil_Type38 Soil_Type39
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## Soil_Type40 Cover_Type
## 1 0 5
## 2 0 5
## 3 0 2
## 4 0 2
## 5 0 5
## 6 0 2
sapply(Xy_original, class)
## Elevation Aspect
## "integer" "integer"
## Slope Horizontal_Distance_To_Hydrology
## "integer" "integer"
## Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways
## "integer" "integer"
## Hillshade_9am Hillshade_Noon
## "integer" "integer"
## Hillshade_3pm Horizontal_Distance_To_Fire_Points
## "integer" "integer"
## Wilderness_Area1 Wilderness_Area2
## "integer" "integer"
## Wilderness_Area3 Wilderness_Area4
## "integer" "integer"
## Soil_Type1 Soil_Type2
## "integer" "integer"
## Soil_Type3 Soil_Type4
## "integer" "integer"
## Soil_Type5 Soil_Type6
## "integer" "integer"
## Soil_Type7 Soil_Type8
## "integer" "integer"
## Soil_Type9 Soil_Type10
## "integer" "integer"
## Soil_Type11 Soil_Type12
## "integer" "integer"
## Soil_Type13 Soil_Type14
## "integer" "integer"
## Soil_Type15 Soil_Type16
## "integer" "integer"
## Soil_Type17 Soil_Type18
## "integer" "integer"
## Soil_Type19 Soil_Type20
## "integer" "integer"
## Soil_Type21 Soil_Type22
## "integer" "integer"
## Soil_Type23 Soil_Type24
## "integer" "integer"
## Soil_Type25 Soil_Type26
## "integer" "integer"
## Soil_Type27 Soil_Type28
## "integer" "integer"
## Soil_Type29 Soil_Type30
## "integer" "integer"
## Soil_Type31 Soil_Type32
## "integer" "integer"
## Soil_Type33 Soil_Type34
## "integer" "integer"
## Soil_Type35 Soil_Type36
## "integer" "integer"
## Soil_Type37 Soil_Type38
## "integer" "integer"
## Soil_Type39 Soil_Type40
## "integer" "integer"
## Cover_Type
## "factor"
sapply(Xy_original, function(x) sum(is.na(x)))
## Elevation Aspect
## 0 0
## Slope Horizontal_Distance_To_Hydrology
## 0 0
## Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways
## 0 0
## Hillshade_9am Hillshade_Noon
## 0 0
## Hillshade_3pm Horizontal_Distance_To_Fire_Points
## 0 0
## Wilderness_Area1 Wilderness_Area2
## 0 0
## Wilderness_Area3 Wilderness_Area4
## 0 0
## Soil_Type1 Soil_Type2
## 0 0
## Soil_Type3 Soil_Type4
## 0 0
## Soil_Type5 Soil_Type6
## 0 0
## Soil_Type7 Soil_Type8
## 0 0
## Soil_Type9 Soil_Type10
## 0 0
## Soil_Type11 Soil_Type12
## 0 0
## Soil_Type13 Soil_Type14
## 0 0
## Soil_Type15 Soil_Type16
## 0 0
## Soil_Type17 Soil_Type18
## 0 0
## Soil_Type19 Soil_Type20
## 0 0
## Soil_Type21 Soil_Type22
## 0 0
## Soil_Type23 Soil_Type24
## 0 0
## Soil_Type25 Soil_Type26
## 0 0
## Soil_Type27 Soil_Type28
## 0 0
## Soil_Type29 Soil_Type30
## 0 0
## Soil_Type31 Soil_Type32
## 0 0
## Soil_Type33 Soil_Type34
## 0 0
## Soil_Type35 Soil_Type36
## 0 0
## Soil_Type37 Soil_Type38
## 0 0
## Soil_Type39 Soil_Type40
## 0 0
## Cover_Type
## 0
# Not applicable for this iteration of the project.
# Take a peek at the dataframe after the cleaning
head(Xy_original)
## Elevation Aspect Slope Horizontal_Distance_To_Hydrology
## 1 2596 51 3 258
## 2 2590 56 2 212
## 3 2804 139 9 268
## 4 2785 155 18 242
## 5 2595 45 2 153
## 6 2579 132 6 300
## Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways
## 1 0 510
## 2 -6 390
## 3 65 3180
## 4 118 3090
## 5 -1 391
## 6 -15 67
## Hillshade_9am Hillshade_Noon Hillshade_3pm
## 1 221 232 148
## 2 220 235 151
## 3 234 238 135
## 4 238 238 122
## 5 220 234 150
## 6 230 237 140
## Horizontal_Distance_To_Fire_Points Wilderness_Area1 Wilderness_Area2
## 1 6279 1 0
## 2 6225 1 0
## 3 6121 1 0
## 4 6211 1 0
## 5 6172 1 0
## 6 6031 1 0
## Wilderness_Area3 Wilderness_Area4 Soil_Type1 Soil_Type2 Soil_Type3
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
## Soil_Type4 Soil_Type5 Soil_Type6 Soil_Type7 Soil_Type8 Soil_Type9
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## Soil_Type10 Soil_Type11 Soil_Type12 Soil_Type13 Soil_Type14 Soil_Type15
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 1 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## Soil_Type16 Soil_Type17 Soil_Type18 Soil_Type19 Soil_Type20 Soil_Type21
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## Soil_Type22 Soil_Type23 Soil_Type24 Soil_Type25 Soil_Type26 Soil_Type27
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## Soil_Type28 Soil_Type29 Soil_Type30 Soil_Type31 Soil_Type32 Soil_Type33
## 1 0 1 0 0 0 0
## 2 0 1 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 1 0 0 0
## 5 0 1 0 0 0 0
## 6 0 1 0 0 0 0
## Soil_Type34 Soil_Type35 Soil_Type36 Soil_Type37 Soil_Type38 Soil_Type39
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## Soil_Type40 Cover_Type
## 1 0 5
## 2 0 5
## 3 0 2
## 4 0 2
## 5 0 5
## 6 0 2
sapply(Xy_original, class)
## Elevation Aspect
## "integer" "integer"
## Slope Horizontal_Distance_To_Hydrology
## "integer" "integer"
## Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways
## "integer" "integer"
## Hillshade_9am Hillshade_Noon
## "integer" "integer"
## Hillshade_3pm Horizontal_Distance_To_Fire_Points
## "integer" "integer"
## Wilderness_Area1 Wilderness_Area2
## "integer" "integer"
## Wilderness_Area3 Wilderness_Area4
## "integer" "integer"
## Soil_Type1 Soil_Type2
## "integer" "integer"
## Soil_Type3 Soil_Type4
## "integer" "integer"
## Soil_Type5 Soil_Type6
## "integer" "integer"
## Soil_Type7 Soil_Type8
## "integer" "integer"
## Soil_Type9 Soil_Type10
## "integer" "integer"
## Soil_Type11 Soil_Type12
## "integer" "integer"
## Soil_Type13 Soil_Type14
## "integer" "integer"
## Soil_Type15 Soil_Type16
## "integer" "integer"
## Soil_Type17 Soil_Type18
## "integer" "integer"
## Soil_Type19 Soil_Type20
## "integer" "integer"
## Soil_Type21 Soil_Type22
## "integer" "integer"
## Soil_Type23 Soil_Type24
## "integer" "integer"
## Soil_Type25 Soil_Type26
## "integer" "integer"
## Soil_Type27 Soil_Type28
## "integer" "integer"
## Soil_Type29 Soil_Type30
## "integer" "integer"
## Soil_Type31 Soil_Type32
## "integer" "integer"
## Soil_Type33 Soil_Type34
## "integer" "integer"
## Soil_Type35 Soil_Type36
## "integer" "integer"
## Soil_Type37 Soil_Type38
## "integer" "integer"
## Soil_Type39 Soil_Type40
## "integer" "integer"
## Cover_Type
## "factor"
sapply(Xy_original, function(x) sum(is.na(x)))
## Elevation Aspect
## 0 0
## Slope Horizontal_Distance_To_Hydrology
## 0 0
## Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways
## 0 0
## Hillshade_9am Hillshade_Noon
## 0 0
## Hillshade_3pm Horizontal_Distance_To_Fire_Points
## 0 0
## Wilderness_Area1 Wilderness_Area2
## 0 0
## Wilderness_Area3 Wilderness_Area4
## 0 0
## Soil_Type1 Soil_Type2
## 0 0
## Soil_Type3 Soil_Type4
## 0 0
## Soil_Type5 Soil_Type6
## 0 0
## Soil_Type7 Soil_Type8
## 0 0
## Soil_Type9 Soil_Type10
## 0 0
## Soil_Type11 Soil_Type12
## 0 0
## Soil_Type13 Soil_Type14
## 0 0
## Soil_Type15 Soil_Type16
## 0 0
## Soil_Type17 Soil_Type18
## 0 0
## Soil_Type19 Soil_Type20
## 0 0
## Soil_Type21 Soil_Type22
## 0 0
## Soil_Type23 Soil_Type24
## 0 0
## Soil_Type25 Soil_Type26
## 0 0
## Soil_Type27 Soil_Type28
## 0 0
## Soil_Type29 Soil_Type30
## 0 0
## Soil_Type31 Soil_Type32
## 0 0
## Soil_Type33 Soil_Type34
## 0 0
## Soil_Type35 Soil_Type36
## 0 0
## Soil_Type37 Soil_Type38
## 0 0
## Soil_Type39 Soil_Type40
## 0 0
## Cover_Type
## 0
# Use variable totCol to hold the number of columns in the dataframe
totCol <- ncol(Xy_original)
# Set up variable totAttr for the total number of attribute columns
totAttr <- totCol-1
# targetCol variable indicates the column location of the target/class variable
# If the first column, set targetCol to 1. If the last column, set targetCol to totCol
# if (targetCol <> 1) and (targetCol <> totCol), be aware when slicing up the dataframes for visualization!
targetCol <- totCol
# Standardize the class column to the name of targetVar if applicable
colnames(Xy_original)[targetCol] <- "targetVar"
# We create training datasets (Xy_train, X_train, y_train) for various visualization and cleaning/transformation operations.
# We create testing datasets (Xy_test, y_test) for various visualization and cleaning/transformation operations.
set.seed(seedNum)
# Create a list of the rows in the original dataset we can use for training
# Use 70% of the data to train the models and the remaining for testing/validation
training_index <- createDataPartition(Xy_original$targetVar, p=0.70, list=FALSE)
Xy_train <- Xy_original[training_index,]
Xy_test <- Xy_original[-training_index,]
if (targetCol==1) {
X_train <- Xy_train[,(targetCol+1):totCol]
y_train <- Xy_train[,targetCol]
y_test <- Xy_test[,targetCol]
} else {
X_train <- Xy_train[,1:(totAttr)]
y_train <- Xy_train[,totCol]
y_test <- Xy_test[,totCol]
}
# Set up the number of row and columns for visualization display. dispRow * dispCol should be >= totAttr
dispCol <- 3
if (totAttr%%dispCol == 0) {
dispRow <- totAttr%/%dispCol
} else {
dispRow <- (totAttr%/%dispCol) + 1
}
cat("Will attempt to create graphics grid (col x row): ", dispCol, ' by ', dispRow)
## Will attempt to create graphics grid (col x row): 3 by 18
# Run algorithms using 10-fold cross validation
control <- trainControl(method="repeatedcv", number=10, repeats=1)
metricTarget <- "Accuracy"
if (notifyStatus) email_notify(paste("Library and Data Loading completed!",date()))
To gain a better understanding of the data that we have on-hand, we will leverage a number of descriptive statistics and data visualization techniques. The plan is to use the results to consider new questions, review assumptions, and validate hypotheses that we can investigate later with specialized models.
if (notifyStatus) email_notify(paste("Data Summarization and Visualization has begun!",date()))
head(Xy_train)
## Elevation Aspect Slope Horizontal_Distance_To_Hydrology
## 1 2596 51 3 258
## 3 2804 139 9 268
## 4 2785 155 18 242
## 5 2595 45 2 153
## 6 2579 132 6 300
## 7 2606 45 7 270
## Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways
## 1 0 510
## 3 65 3180
## 4 118 3090
## 5 -1 391
## 6 -15 67
## 7 5 633
## Hillshade_9am Hillshade_Noon Hillshade_3pm
## 1 221 232 148
## 3 234 238 135
## 4 238 238 122
## 5 220 234 150
## 6 230 237 140
## 7 222 225 138
## Horizontal_Distance_To_Fire_Points Wilderness_Area1 Wilderness_Area2
## 1 6279 1 0
## 3 6121 1 0
## 4 6211 1 0
## 5 6172 1 0
## 6 6031 1 0
## 7 6256 1 0
## Wilderness_Area3 Wilderness_Area4 Soil_Type1 Soil_Type2 Soil_Type3
## 1 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
## 7 0 0 0 0 0
## Soil_Type4 Soil_Type5 Soil_Type6 Soil_Type7 Soil_Type8 Soil_Type9
## 1 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## 7 0 0 0 0 0 0
## Soil_Type10 Soil_Type11 Soil_Type12 Soil_Type13 Soil_Type14 Soil_Type15
## 1 0 0 0 0 0 0
## 3 0 0 1 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## 7 0 0 0 0 0 0
## Soil_Type16 Soil_Type17 Soil_Type18 Soil_Type19 Soil_Type20 Soil_Type21
## 1 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## 7 0 0 0 0 0 0
## Soil_Type22 Soil_Type23 Soil_Type24 Soil_Type25 Soil_Type26 Soil_Type27
## 1 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## 7 0 0 0 0 0 0
## Soil_Type28 Soil_Type29 Soil_Type30 Soil_Type31 Soil_Type32 Soil_Type33
## 1 0 1 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 1 0 0 0
## 5 0 1 0 0 0 0
## 6 0 1 0 0 0 0
## 7 0 1 0 0 0 0
## Soil_Type34 Soil_Type35 Soil_Type36 Soil_Type37 Soil_Type38 Soil_Type39
## 1 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## 7 0 0 0 0 0 0
## Soil_Type40 targetVar
## 1 0 5
## 3 0 2
## 4 0 2
## 5 0 5
## 6 0 2
## 7 0 5
dim(Xy_train)
## [1] 10584 55
sapply(Xy_train, class)
## Elevation Aspect
## "integer" "integer"
## Slope Horizontal_Distance_To_Hydrology
## "integer" "integer"
## Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways
## "integer" "integer"
## Hillshade_9am Hillshade_Noon
## "integer" "integer"
## Hillshade_3pm Horizontal_Distance_To_Fire_Points
## "integer" "integer"
## Wilderness_Area1 Wilderness_Area2
## "integer" "integer"
## Wilderness_Area3 Wilderness_Area4
## "integer" "integer"
## Soil_Type1 Soil_Type2
## "integer" "integer"
## Soil_Type3 Soil_Type4
## "integer" "integer"
## Soil_Type5 Soil_Type6
## "integer" "integer"
## Soil_Type7 Soil_Type8
## "integer" "integer"
## Soil_Type9 Soil_Type10
## "integer" "integer"
## Soil_Type11 Soil_Type12
## "integer" "integer"
## Soil_Type13 Soil_Type14
## "integer" "integer"
## Soil_Type15 Soil_Type16
## "integer" "integer"
## Soil_Type17 Soil_Type18
## "integer" "integer"
## Soil_Type19 Soil_Type20
## "integer" "integer"
## Soil_Type21 Soil_Type22
## "integer" "integer"
## Soil_Type23 Soil_Type24
## "integer" "integer"
## Soil_Type25 Soil_Type26
## "integer" "integer"
## Soil_Type27 Soil_Type28
## "integer" "integer"
## Soil_Type29 Soil_Type30
## "integer" "integer"
## Soil_Type31 Soil_Type32
## "integer" "integer"
## Soil_Type33 Soil_Type34
## "integer" "integer"
## Soil_Type35 Soil_Type36
## "integer" "integer"
## Soil_Type37 Soil_Type38
## "integer" "integer"
## Soil_Type39 Soil_Type40
## "integer" "integer"
## targetVar
## "factor"
summary(Xy_train)
## Elevation Aspect Slope
## Min. :1874 Min. : 0.0 Min. : 0.00
## 1st Qu.:2377 1st Qu.: 65.0 1st Qu.:10.00
## Median :2751 Median :126.0 Median :15.00
## Mean :2751 Mean :156.8 Mean :16.48
## 3rd Qu.:3108 3rd Qu.:260.0 3rd Qu.:22.00
## Max. :3846 Max. :360.0 Max. :50.00
##
## Horizontal_Distance_To_Hydrology Vertical_Distance_To_Hydrology
## Min. : 0.0 Min. :-134.00
## 1st Qu.: 60.0 1st Qu.: 4.00
## Median : 180.0 Median : 32.00
## Mean : 226.4 Mean : 50.53
## 3rd Qu.: 323.2 3rd Qu.: 79.00
## Max. :1343.0 Max. : 554.00
##
## Horizontal_Distance_To_Roadways Hillshade_9am Hillshade_Noon
## Min. : 0 Min. : 58.0 Min. : 99.0
## 1st Qu.: 768 1st Qu.:196.0 1st Qu.:207.0
## Median :1317 Median :220.0 Median :223.0
## Mean :1714 Mean :212.7 Mean :219.1
## 3rd Qu.:2263 3rd Qu.:235.0 3rd Qu.:235.0
## Max. :6836 Max. :254.0 Max. :254.0
##
## Hillshade_3pm Horizontal_Distance_To_Fire_Points Wilderness_Area1
## Min. : 0.0 Min. : 30 Min. :0.0000
## 1st Qu.:107.0 1st Qu.: 732 1st Qu.:0.0000
## Median :138.0 Median :1256 Median :0.0000
## Mean :135.2 Mean :1515 Mean :0.2367
## 3rd Qu.:167.0 3rd Qu.:1992 3rd Qu.:0.0000
## Max. :247.0 Max. :6993 Max. :1.0000
##
## Wilderness_Area2 Wilderness_Area3 Wilderness_Area4 Soil_Type1
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.03231 Mean :0.4228 Mean :0.3082 Mean :0.0239
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## Soil_Type2 Soil_Type3 Soil_Type4 Soil_Type5
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.04176 Mean :0.06349 Mean :0.05678 Mean :0.01058
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## Soil_Type6 Soil_Type7 Soil_Type8 Soil_Type9
## Min. :0.00000 Min. :0 Min. :0.00e+00 Min. :0.0000000
## 1st Qu.:0.00000 1st Qu.:0 1st Qu.:0.00e+00 1st Qu.:0.0000000
## Median :0.00000 Median :0 Median :0.00e+00 Median :0.0000000
## Mean :0.04091 Mean :0 Mean :9.45e-05 Mean :0.0006614
## 3rd Qu.:0.00000 3rd Qu.:0 3rd Qu.:0.00e+00 3rd Qu.:0.0000000
## Max. :1.00000 Max. :0 Max. :1.00e+00 Max. :1.0000000
##
## Soil_Type10 Soil_Type11 Soil_Type12 Soil_Type13
## Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.1419 Mean :0.02731 Mean :0.01455 Mean :0.03193
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## Soil_Type14 Soil_Type15 Soil_Type16 Soil_Type17
## Min. :0.00000 Min. :0 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.00000 Median :0 Median :0.000000 Median :0.00000
## Mean :0.01039 Mean :0 Mean :0.008031 Mean :0.04034
## 3rd Qu.:0.00000 3rd Qu.:0 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :1.00000 Max. :0 Max. :1.000000 Max. :1.00000
##
## Soil_Type18 Soil_Type19 Soil_Type20
## Min. :0.000000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.000000 Median :0.000000 Median :0.000000
## Mean :0.003874 Mean :0.003212 Mean :0.009448
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.000000 Max. :1.000000
##
## Soil_Type21 Soil_Type22 Soil_Type23 Soil_Type24
## Min. :0.000000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.000000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.001323 Mean :0.02353 Mean :0.05102 Mean :0.01635
## 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## Soil_Type25 Soil_Type26 Soil_Type27
## Min. :0.00e+00 Min. :0.000000 Min. :0.0000000
## 1st Qu.:0.00e+00 1st Qu.:0.000000 1st Qu.:0.0000000
## Median :0.00e+00 Median :0.000000 Median :0.0000000
## Mean :9.45e-05 Mean :0.003401 Mean :0.0008503
## 3rd Qu.:0.00e+00 3rd Qu.:0.000000 3rd Qu.:0.0000000
## Max. :1.00e+00 Max. :1.000000 Max. :1.0000000
##
## Soil_Type28 Soil_Type29 Soil_Type30 Soil_Type31
## Min. :0.0000000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.0002834 Mean :0.08626 Mean :0.04639 Mean :0.02173
## 3rd Qu.:0.0000000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.0000000 Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## Soil_Type32 Soil_Type33 Soil_Type34 Soil_Type35
## Min. :0.00000 Min. :0.00000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.00000 Median :0.00000 Median :0.000000 Median :0.000000
## Mean :0.04639 Mean :0.04101 Mean :0.001417 Mean :0.007086
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.00000 Max. :1.000000 Max. :1.000000
##
## Soil_Type36 Soil_Type37 Soil_Type38 Soil_Type39
## Min. :0.0000000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.0006614 Mean :0.00189 Mean :0.04639 Mean :0.04403
## 3rd Qu.:0.0000000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.0000000 Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## Soil_Type40 targetVar
## Min. :0.00000 1:1512
## 1st Qu.:0.00000 2:1512
## Median :0.00000 3:1512
## Mean :0.03071 4:1512
## 3rd Qu.:0.00000 5:1512
## Max. :1.00000 6:1512
## 7:1512
sapply(Xy_train, function(x) sum(is.na(x)))
## Elevation Aspect
## 0 0
## Slope Horizontal_Distance_To_Hydrology
## 0 0
## Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways
## 0 0
## Hillshade_9am Hillshade_Noon
## 0 0
## Hillshade_3pm Horizontal_Distance_To_Fire_Points
## 0 0
## Wilderness_Area1 Wilderness_Area2
## 0 0
## Wilderness_Area3 Wilderness_Area4
## 0 0
## Soil_Type1 Soil_Type2
## 0 0
## Soil_Type3 Soil_Type4
## 0 0
## Soil_Type5 Soil_Type6
## 0 0
## Soil_Type7 Soil_Type8
## 0 0
## Soil_Type9 Soil_Type10
## 0 0
## Soil_Type11 Soil_Type12
## 0 0
## Soil_Type13 Soil_Type14
## 0 0
## Soil_Type15 Soil_Type16
## 0 0
## Soil_Type17 Soil_Type18
## 0 0
## Soil_Type19 Soil_Type20
## 0 0
## Soil_Type21 Soil_Type22
## 0 0
## Soil_Type23 Soil_Type24
## 0 0
## Soil_Type25 Soil_Type26
## 0 0
## Soil_Type27 Soil_Type28
## 0 0
## Soil_Type29 Soil_Type30
## 0 0
## Soil_Type31 Soil_Type32
## 0 0
## Soil_Type33 Soil_Type34
## 0 0
## Soil_Type35 Soil_Type36
## 0 0
## Soil_Type37 Soil_Type38
## 0 0
## Soil_Type39 Soil_Type40
## 0 0
## targetVar
## 0
cbind(freq=table(y_train), percentage=prop.table(table(y_train))*100)
## freq percentage
## 1 1512 14.28571
## 2 1512 14.28571
## 3 1512 14.28571
## 4 1512 14.28571
## 5 1512 14.28571
## 6 1512 14.28571
## 7 1512 14.28571
# Boxplots for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
boxplot(X_train[,i], main=names(X_train)[i])
}
# Histograms each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
hist(X_train[,i], main=names(X_train)[i])
}
# Density plot for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
plot(density(X_train[,i]), main=names(X_train)[i])
}
# Scatterplot matrix colored by class
# pairs(targetVar~., data=Xy_train, col=Xy_train$targetVar)
# Box and whisker plots for each attribute by class
# scales <- list(x=list(relation="free"), y=list(relation="free"))
# featurePlot(x=X_train, y=y_train, plot="box", scales=scales)
# Density plots for each attribute by class value
# featurePlot(x=X_train, y=y_train, plot="density", scales=scales)
# Correlation plot
correlations <- cor(X_train)
## Warning in cor(X_train): the standard deviation is zero
corrplot(correlations, method="circle")
if (notifyStatus) email_notify(paste("Data Summarization and Visualization completed!",date()))
Some dataset may require additional preparation activities that will best exposes the structure of the problem and the relationships between the input attributes and the output variable. Some data-prep tasks might include:
if (notifyStatus) email_notify(paste("Data Cleaning and Transformation has begun!",date()))
# Not applicable for this iteration of the project.
# Not applicable for this iteration of the project.
# Perform the Recursive Feature Elimination (RFE) technique
startTimeModule <- proc.time()
set.seed(seedNum)
X_rfe <- Xy_train[,1:totAttr]
y_rfe <- Xy_train[,totCol]
normalization <- preProcess(X_rfe)
## Warning in preProcess.default(X_rfe): These variables have zero variances:
## Soil_Type7, Soil_Type15
X_rfe <- predict(normalization, X_rfe)
X_rfe <- as.data.frame(X_rfe)
rfeCTRL <- rfeControl(functions=rfFuncs, method="cv", number=10, repeats=1, verbose=FALSE, returnResamp="all")
optimalVars <- 50
subsets <- c(2:optimalVars)
rfeProfile <- rfe(X_rfe, y_rfe, sizes=subsets, rfeControl=rfeCTRL)
print(rfeProfile)
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 2 0.6002 0.5335 0.014390 0.016786
## 3 0.6917 0.6403 0.019110 0.022293
## 4 0.7901 0.7551 0.015449 0.018020
## 5 0.8093 0.7776 0.013989 0.016317
## 6 0.8206 0.7907 0.017082 0.019927
## 7 0.8350 0.8075 0.011035 0.012872
## 8 0.8298 0.8015 0.011884 0.013863
## 9 0.8280 0.7994 0.011159 0.013016
## 10 0.8300 0.8017 0.008177 0.009537
## 11 0.8402 0.8136 0.006374 0.007436
## 12 0.8419 0.8156 0.011047 0.012885 *
## 13 0.8403 0.8137 0.010572 0.012331
## 14 0.8359 0.8085 0.011841 0.013814
## 15 0.8332 0.8054 0.009684 0.011296
## 16 0.8415 0.8150 0.010639 0.012409
## 17 0.8401 0.8135 0.012045 0.014050
## 18 0.8411 0.8146 0.009546 0.011134
## 19 0.8383 0.8114 0.011660 0.013601
## 20 0.8352 0.8078 0.009707 0.011323
## 21 0.8343 0.8067 0.010848 0.012654
## 22 0.8348 0.8073 0.009347 0.010902
## 23 0.8313 0.8031 0.008934 0.010421
## 24 0.8297 0.8014 0.011165 0.013024
## 25 0.8411 0.8146 0.010265 0.011974
## 26 0.8412 0.8147 0.011118 0.012969
## 27 0.8403 0.8137 0.011142 0.012998
## 28 0.8387 0.8118 0.010594 0.012359
## 29 0.8346 0.8070 0.013208 0.015408
## 30 0.8342 0.8066 0.011277 0.013155
## 31 0.8325 0.8046 0.011607 0.013540
## 32 0.8305 0.8023 0.012344 0.014400
## 33 0.8291 0.8006 0.012798 0.014930
## 34 0.8267 0.7978 0.012917 0.015069
## 35 0.8251 0.7960 0.013152 0.015343
## 36 0.8365 0.8092 0.013346 0.015569
## 37 0.8364 0.8091 0.015186 0.017716
## 38 0.8335 0.8058 0.013125 0.015311
## 39 0.8325 0.8046 0.012022 0.014023
## 40 0.8306 0.8024 0.011440 0.013345
## 41 0.8302 0.8019 0.013796 0.016093
## 42 0.8287 0.8002 0.013907 0.016223
## 43 0.8250 0.7959 0.010943 0.012765
## 44 0.8253 0.7962 0.013644 0.015916
## 45 0.8231 0.7937 0.012249 0.014287
## 46 0.8220 0.7923 0.013252 0.015458
## 47 0.8185 0.7883 0.010991 0.012820
## 48 0.8141 0.7831 0.011523 0.013441
## 49 0.8292 0.8007 0.015338 0.017892
## 50 0.8265 0.7976 0.012120 0.014137
## 54 0.8180 0.7877 0.012557 0.014647
##
## The top 5 variables (out of 12):
## Elevation, Horizontal_Distance_To_Roadways, Horizontal_Distance_To_Hydrology, Horizontal_Distance_To_Fire_Points, Vertical_Distance_To_Hydrology
plot(rfeProfile, type=c("g", "o"))
# Perform the Recursive Feature Elimination (RFE) technique
numberRFEVars <- length(predictors(rfeProfile))
if (numberRFEVars <= optimalVars) {
rfeAttributes <- predictors(rfeProfile)
} else {
newProfile <- update(rfeProfile, x=X_rfe, y=y_rfe, size=optimalVars)
rfeAttributes <- newProfile$bestVar
}
cat('Number of attributes selected from the RFE algorithm:',length(rfeAttributes),'\n')
## Number of attributes selected from the RFE algorithm: 12
print(rfeAttributes)
## [1] "Elevation"
## [2] "Horizontal_Distance_To_Roadways"
## [3] "Horizontal_Distance_To_Hydrology"
## [4] "Horizontal_Distance_To_Fire_Points"
## [5] "Vertical_Distance_To_Hydrology"
## [6] "Hillshade_Noon"
## [7] "Hillshade_9am"
## [8] "Hillshade_3pm"
## [9] "Aspect"
## [10] "Wilderness_Area4"
## [11] "Soil_Type10"
## [12] "Wilderness_Area1"
# Removing the unselected attributes from the training and validation dataframes
rfeAttributes <- c(rfeAttributes,"targetVar")
Xy_train <- Xy_train[, (names(Xy_train) %in% rfeAttributes)]
Xy_test <- Xy_test[, (names(Xy_test) %in% rfeAttributes)]
dim(Xy_train)
## [1] 10584 13
dim(Xy_test)
## [1] 4536 13
sapply(Xy_train, class)
## Elevation Aspect
## "integer" "integer"
## Horizontal_Distance_To_Hydrology Vertical_Distance_To_Hydrology
## "integer" "integer"
## Horizontal_Distance_To_Roadways Hillshade_9am
## "integer" "integer"
## Hillshade_Noon Hillshade_3pm
## "integer" "integer"
## Horizontal_Distance_To_Fire_Points Wilderness_Area1
## "integer" "integer"
## Wilderness_Area4 Soil_Type10
## "integer" "integer"
## targetVar
## "factor"
if (notifyStatus) email_notify(paste("Data Cleaning and Transformation completed!",date()))
proc.time()-startTimeScript
## user system elapsed
## 14281.941 36.026 14345.447
After the data-prep, we next work on finding a workable model by evaluating a subset of machine learning algorithms that are good at exploiting the structure of the training. The typical evaluation tasks include:
For this project, we will evaluate one linear, one non-linear, and three ensemble algorithms:
Linear Algorithm: Linear Discriminant Analysis
Non-Linear Algorithm: Decision Trees (CART)
Ensemble Algorithms: Bagged CART, Random Forest, and Gradient Boosting
The random number seed is reset before each run to ensure that the evaluation of each algorithm is performed using the same data splits. It ensures the results are directly comparable.
startModeling <- proc.time()
# Linear Discriminant Analysis (Classification)
# if (notifyStatus) email_notify(paste("Linear Discriminant Analysis modeling has begun!",date()))
# startTimeModule <- proc.time()
# set.seed(seedNum)
# fit.lda <- train(targetVar~., data=Xy_train, method="lda", metric=metricTarget, trControl=control)
# print(fit.lda)
# proc.time()-startTimeModule
# if (notifyStatus) email_notify(paste("Linear Discriminant Analysis modeling completed!",date()))
# Decision Tree - CART (Regression/Classification)
if (notifyStatus) email_notify(paste("Decision Tree modeling has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
fit.cart <- train(targetVar~., data=Xy_train, method="rpart", metric=metricTarget, trControl=control)
print(fit.cart)
## CART
##
## 10584 samples
## 12 predictor
## 7 classes: '1', '2', '3', '4', '5', '6', '7'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 9525, 9527, 9526, 9524, 9524, 9526, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.09656085 0.4616631 0.37194057
## 0.13966049 0.3336852 0.22266751
## 0.16666667 0.2140159 0.08322312
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.09656085.
proc.time()-startTimeModule
## user system elapsed
## 2.478 0.856 2.387
if (notifyStatus) email_notify(paste("Decision Tree modeling completed!",date()))
In this section, we will explore the use and tuning of ensemble algorithms to see whether we can improve the results.
# Bagged CART (Regression/Classification)
if (notifyStatus) email_notify(paste("Bagged CART modeling has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
fit.bagcart <- train(targetVar~., data=Xy_train, method="treebag", metric=metricTarget, trControl=control)
print(fit.bagcart)
## Bagged CART
##
## 10584 samples
## 12 predictor
## 7 classes: '1', '2', '3', '4', '5', '6', '7'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 9525, 9527, 9526, 9524, 9524, 9526, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8234126 0.7939812
proc.time()-startTimeModule
## user system elapsed
## 43.964 24.746 39.495
if (notifyStatus) email_notify(paste("Bagged CART modeling completed!",date()))
# Random Forest (Regression/Classification)
if (notifyStatus) email_notify(paste("Random Forest modeling has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
fit.rf <- train(targetVar~., data=Xy_train, method="rf", metric=metricTarget, trControl=control)
print(fit.rf)
## Random Forest
##
## 10584 samples
## 12 predictor
## 7 classes: '1', '2', '3', '4', '5', '6', '7'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 9525, 9527, 9526, 9524, 9524, 9526, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8197284 0.7896826
## 7 0.8417432 0.8153668
## 12 0.8331438 0.8053340
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 7.
proc.time()-startTimeModule
## user system elapsed
## 307.654 2.061 310.372
if (notifyStatus) email_notify(paste("Random Forest modeling completed!",date()))
# Gradient Boosting (Regression/Classification)
if (notifyStatus) email_notify(paste("Gradient Boosting modeling has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
fit.gbm <- train(targetVar~., data=Xy_train, method="xgbTree", metric=metricTarget, trControl=control, verbose=F)
# fit.gbm <- train(targetVar~., data=Xy_train, method="gbm", metric=metricTarget, trControl=control, verbose=F)
print(fit.gbm)
## eXtreme Gradient Boosting
##
## 10584 samples
## 12 predictor
## 7 classes: '1', '2', '3', '4', '5', '6', '7'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 9525, 9527, 9526, 9524, 9524, 9526, ...
## Resampling results across tuning parameters:
##
## eta max_depth colsample_bytree subsample nrounds Accuracy
## 0.3 1 0.6 0.50 50 0.6792317
## 0.3 1 0.6 0.50 100 0.7031374
## 0.3 1 0.6 0.50 150 0.7155177
## 0.3 1 0.6 0.75 50 0.6783846
## 0.3 1 0.6 0.75 100 0.7044618
## 0.3 1 0.6 0.75 150 0.7134367
## 0.3 1 0.6 1.00 50 0.6750755
## 0.3 1 0.6 1.00 100 0.7007749
## 0.3 1 0.6 1.00 150 0.7112663
## 0.3 1 0.8 0.50 50 0.6748842
## 0.3 1 0.8 0.50 100 0.7074851
## 0.3 1 0.8 0.50 150 0.7139084
## 0.3 1 0.8 0.75 50 0.6776288
## 0.3 1 0.8 0.75 100 0.7052192
## 0.3 1 0.8 0.75 150 0.7172182
## 0.3 1 0.8 1.00 50 0.6740368
## 0.3 1 0.8 1.00 100 0.6992645
## 0.3 1 0.8 1.00 150 0.7114555
## 0.3 2 0.6 0.50 50 0.7227910
## 0.3 2 0.6 0.50 100 0.7505667
## 0.3 2 0.6 0.50 150 0.7624741
## 0.3 2 0.6 0.75 50 0.7249628
## 0.3 2 0.6 0.75 100 0.7521732
## 0.3 2 0.6 0.75 150 0.7641719
## 0.3 2 0.6 1.00 50 0.7195780
## 0.3 2 0.6 1.00 100 0.7450870
## 0.3 2 0.6 1.00 150 0.7621876
## 0.3 2 0.8 0.50 50 0.7293085
## 0.3 2 0.8 0.50 100 0.7541579
## 0.3 2 0.8 0.50 150 0.7675740
## 0.3 2 0.8 0.75 50 0.7293097
## 0.3 2 0.8 0.75 100 0.7542545
## 0.3 2 0.8 0.75 150 0.7663507
## 0.3 2 0.8 1.00 50 0.7215624
## 0.3 2 0.8 1.00 100 0.7486817
## 0.3 2 0.8 1.00 150 0.7634192
## 0.3 3 0.6 0.50 50 0.7569923
## 0.3 3 0.6 0.50 100 0.7805194
## 0.3 3 0.6 0.50 150 0.7884566
## 0.3 3 0.6 0.75 50 0.7529289
## 0.3 3 0.6 0.75 100 0.7825943
## 0.3 3 0.6 0.75 150 0.7962009
## 0.3 3 0.6 1.00 50 0.7481107
## 0.3 3 0.6 1.00 100 0.7808001
## 0.3 3 0.6 1.00 150 0.7917573
## 0.3 3 0.8 0.50 50 0.7599218
## 0.3 3 0.8 0.50 100 0.7876036
## 0.3 3 0.8 0.50 150 0.7960136
## 0.3 3 0.8 0.75 50 0.7609587
## 0.3 3 0.8 0.75 100 0.7864707
## 0.3 3 0.8 0.75 150 0.7979948
## 0.3 3 0.8 1.00 50 0.7548173
## 0.3 3 0.8 1.00 100 0.7816530
## 0.3 3 0.8 1.00 150 0.7951616
## 0.4 1 0.6 0.50 50 0.6933133
## 0.4 1 0.6 0.50 100 0.7162725
## 0.4 1 0.6 0.50 150 0.7200517
## 0.4 1 0.6 0.75 50 0.6862229
## 0.4 1 0.6 0.75 100 0.7118321
## 0.4 1 0.6 0.75 150 0.7206166
## 0.4 1 0.6 1.00 50 0.6858490
## 0.4 1 0.6 1.00 100 0.7088098
## 0.4 1 0.6 1.00 150 0.7176920
## 0.4 1 0.8 0.50 50 0.6931219
## 0.4 1 0.8 0.50 100 0.7142885
## 0.4 1 0.8 0.50 150 0.7207128
## 0.4 1 0.8 0.75 50 0.6907621
## 0.4 1 0.8 0.75 100 0.7122101
## 0.4 1 0.8 0.75 150 0.7212791
## 0.4 1 0.8 1.00 50 0.6866038
## 0.4 1 0.8 1.00 100 0.7071103
## 0.4 1 0.8 1.00 150 0.7152362
## 0.4 2 0.6 0.50 50 0.7380012
## 0.4 2 0.6 0.50 100 0.7591661
## 0.4 2 0.6 0.50 150 0.7715436
## 0.4 2 0.6 0.75 50 0.7298768
## 0.4 2 0.6 0.75 100 0.7584095
## 0.4 2 0.6 0.75 150 0.7716374
## 0.4 2 0.6 1.00 50 0.7316727
## 0.4 2 0.6 1.00 100 0.7572761
## 0.4 2 0.6 1.00 150 0.7707905
## 0.4 2 0.8 0.50 50 0.7399869
## 0.4 2 0.8 0.50 100 0.7600167
## 0.4 2 0.8 0.50 150 0.7698431
## 0.4 2 0.8 0.75 50 0.7413090
## 0.4 2 0.8 0.75 100 0.7660663
## 0.4 2 0.8 0.75 150 0.7745682
## 0.4 2 0.8 1.00 50 0.7341281
## 0.4 2 0.8 1.00 100 0.7576565
## 0.4 2 0.8 1.00 150 0.7709773
## 0.4 3 0.6 0.50 50 0.7641744
## 0.4 3 0.6 0.50 100 0.7830730
## 0.4 3 0.6 0.50 150 0.7910073
## 0.4 3 0.6 0.75 50 0.7663451
## 0.4 3 0.6 0.75 100 0.7900626
## 0.4 3 0.6 0.75 150 0.7994181
## 0.4 3 0.6 1.00 50 0.7627546
## 0.4 3 0.6 1.00 100 0.7876052
## 0.4 3 0.6 1.00 150 0.7954459
## 0.4 3 0.8 0.50 50 0.7703129
## 0.4 3 0.8 0.50 100 0.7898706
## 0.4 3 0.8 0.50 150 0.7976186
## 0.4 3 0.8 0.75 50 0.7729615
## 0.4 3 0.8 0.75 100 0.7938395
## 0.4 3 0.8 0.75 150 0.8031933
## 0.4 3 0.8 1.00 50 0.7668207
## 0.4 3 0.8 1.00 100 0.7921388
## 0.4 3 0.8 1.00 150 0.8003606
## Kappa
## 0.6257675
## 0.6536585
## 0.6681022
## 0.6247807
## 0.6552040
## 0.6656750
## 0.6209193
## 0.6509023
## 0.6631423
## 0.6206966
## 0.6587305
## 0.6662256
## 0.6238983
## 0.6560874
## 0.6700862
## 0.6197082
## 0.6491399
## 0.6633636
## 0.6765881
## 0.7089933
## 0.7228859
## 0.6791216
## 0.7108676
## 0.7248666
## 0.6728391
## 0.7026003
## 0.7225518
## 0.6841916
## 0.7131834
## 0.7288357
## 0.6841934
## 0.7132962
## 0.7274086
## 0.6751541
## 0.7067936
## 0.7239886
## 0.7164906
## 0.7439382
## 0.7531987
## 0.7117489
## 0.7463590
## 0.7622335
## 0.7061280
## 0.7442661
## 0.7570498
## 0.7199080
## 0.7522032
## 0.7620152
## 0.7211180
## 0.7508826
## 0.7643271
## 0.7139521
## 0.7452608
## 0.7610208
## 0.6421964
## 0.6689826
## 0.6733922
## 0.6339247
## 0.6638023
## 0.6740512
## 0.6334885
## 0.6602767
## 0.6706393
## 0.6419732
## 0.6666681
## 0.6741626
## 0.6392202
## 0.6642433
## 0.6748237
## 0.6343689
## 0.6582934
## 0.6677740
## 0.6943337
## 0.7190265
## 0.7334676
## 0.6848547
## 0.7181440
## 0.7335766
## 0.6869499
## 0.7168210
## 0.7325876
## 0.6966499
## 0.7200188
## 0.7314833
## 0.6981932
## 0.7270770
## 0.7369957
## 0.6898147
## 0.7172652
## 0.7328064
## 0.7248696
## 0.7469179
## 0.7561747
## 0.7274021
## 0.7550727
## 0.7659877
## 0.7232124
## 0.7522058
## 0.7613534
## 0.7320314
## 0.7548490
## 0.7638886
## 0.7351212
## 0.7594791
## 0.7703924
## 0.7279559
## 0.7574943
## 0.7670868
##
## Tuning parameter 'gamma' was held constant at a value of 0
##
## Tuning parameter 'min_child_weight' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 150, max_depth = 3,
## eta = 0.4, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1
## and subsample = 0.75.
proc.time()-startTimeModule
## user system elapsed
## 1660.538 22.384 852.271
if (notifyStatus) email_notify(paste("Gradient Boosting modeling completed!",date()))
results <- resamples(list(CART=fit.cart, BDT=fit.bagcart, RF=fit.rf, GBM=fit.gbm))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: CART, BDT, RF, GBM
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## CART 0.4018868 0.4288752 0.4806238 0.4616631 0.4890160 0.4962193 0
## BDT 0.8119093 0.8180319 0.8219193 0.8234126 0.8270321 0.8421550 0
## RF 0.8300283 0.8394418 0.8408870 0.8417432 0.8455380 0.8525520 0
## GBM 0.7892250 0.7996692 0.8007561 0.8031933 0.8072835 0.8204159 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## CART 0.3020993 0.3337927 0.3938647 0.3719406 0.4038891 0.4121922 0
## BDT 0.7805586 0.7877062 0.7922423 0.7939812 0.7982042 0.8158487 0
## RF 0.8017071 0.8126812 0.8143637 0.8153668 0.8197955 0.8279791 0
## GBM 0.7540940 0.7662846 0.7675493 0.7703924 0.7751639 0.7904835 0
dotplot(results)
cat('The average accuracy from all models is:',
mean(c(results$values$`CART~Accuracy`,results$values$`BDT~Accuracy`,results$values$`RF~Accuracy`,results$values$`GBM~Accuracy`)),'\n')
## The average accuracy from all models is: 0.7325031
cat('Total training time for all models:',proc.time()-startModeling)
## Total training time for all models: 2015.145 50.048 1205.037 0 0
After we achieve a short list of machine learning algorithms with good level of accuracy, we can leverage ways to improve the accuracy of the models.
Using the three best-perfoming algorithms from the previous section, we will Search for a combination of parameters for each algorithm that yields the best results.
Finally, we will tune the best-performing algorithms from each group further and see whether we can get more accuracy out of them.
# Tuning algorithm #1 - Bagged CART
# No tuning parameters available for "treebag" in the caret package
if (notifyStatus) email_notify(paste("Algorithm #1 tuning has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
fit.final1 <- fit.bagcart
print(fit.final1)
## Bagged CART
##
## 10584 samples
## 12 predictor
## 7 classes: '1', '2', '3', '4', '5', '6', '7'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 9525, 9527, 9526, 9524, 9524, 9526, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8234126 0.7939812
proc.time()-startTimeModule
## user system elapsed
## 0.006 0.001 0.008
if (notifyStatus) email_notify(paste("Algorithm #1 tuning completed!",date()))
# Tuning algorithm #2 - Random Forest
if (notifyStatus) email_notify(paste("Algorithm #2 tuning has begun!",date()))
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(mtry = c(2,5,7,10,12))
fit.final2 <- train(targetVar~., data=Xy_train, method="rf", metric=metricTarget, tuneGrid=grid, trControl=control)
plot(fit.final2)
print(fit.final2)
## Random Forest
##
## 10584 samples
## 12 predictor
## 7 classes: '1', '2', '3', '4', '5', '6', '7'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 9525, 9527, 9526, 9524, 9524, 9526, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8178379 0.7874769
## 5 0.8424034 0.8161366
## 7 0.8394758 0.8127214
## 10 0.8361681 0.8088626
## 12 0.8321061 0.8041235
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 5.
proc.time()-startTimeModule
## user system elapsed
## 557.843 3.669 562.572
if (notifyStatus) email_notify(paste("Algorithm #2 tuning completed!",date()))
results <- resamples(list(BDT=fit.final1, RF=fit.final2))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: BDT, RF
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## BDT 0.8119093 0.8180319 0.8219193 0.8234126 0.8270321 0.8421550 0
## RF 0.8344371 0.8379017 0.8413605 0.8424034 0.8445010 0.8563327 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## BDT 0.7805586 0.7877062 0.7922423 0.7939812 0.7982042 0.8158487 0
## RF 0.8068433 0.8108822 0.8149208 0.8161366 0.8185794 0.8323887 0
dotplot(results)
Once we have narrow down to a model that we believe can make accurate predictions on unseen data, we are ready to finalize it. Finalizing a model may involve sub-tasks such as:
if (notifyStatus) email_notify(paste("Model Validation and Final Model Creation has begun!",date()))
predictions <- predict(fit.final1, newdata=Xy_test)
confusionMatrix(predictions, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5 6 7
## 1 452 126 0 0 0 0 33
## 2 116 421 7 0 15 1 2
## 3 2 20 516 19 9 86 0
## 4 0 0 34 622 0 24 0
## 5 15 57 10 0 612 7 1
## 6 3 18 81 7 12 530 0
## 7 60 6 0 0 0 0 612
##
## Overall Statistics
##
## Accuracy : 0.83
## 95% CI : (0.8188, 0.8409)
## No Information Rate : 0.1429
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8017
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity 0.69753 0.64969 0.7963 0.9599 0.9444 0.8179
## Specificity 0.95910 0.96373 0.9650 0.9851 0.9769 0.9689
## Pos Pred Value 0.73977 0.74911 0.7914 0.9147 0.8718 0.8141
## Neg Pred Value 0.95006 0.94288 0.9660 0.9933 0.9906 0.9696
## Prevalence 0.14286 0.14286 0.1429 0.1429 0.1429 0.1429
## Detection Rate 0.09965 0.09281 0.1138 0.1371 0.1349 0.1168
## Detection Prevalence 0.13470 0.12390 0.1437 0.1499 0.1548 0.1435
## Balanced Accuracy 0.82832 0.80671 0.8807 0.9725 0.9606 0.8934
## Class: 7
## Sensitivity 0.9444
## Specificity 0.9830
## Pos Pred Value 0.9027
## Neg Pred Value 0.9907
## Prevalence 0.1429
## Detection Rate 0.1349
## Detection Prevalence 0.1495
## Balanced Accuracy 0.9637
predictions <- predict(fit.final2, newdata=Xy_test)
confusionMatrix(predictions, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5 6 7
## 1 469 103 0 0 0 0 25
## 2 108 449 9 0 10 5 0
## 3 0 18 522 18 7 70 0
## 4 0 0 35 624 0 26 0
## 5 17 53 6 0 619 7 1
## 6 2 19 76 6 12 540 0
## 7 52 6 0 0 0 0 622
##
## Overall Statistics
##
## Accuracy : 0.8477
## 95% CI : (0.8369, 0.858)
## No Information Rate : 0.1429
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8223
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity 0.7238 0.69290 0.8056 0.9630 0.9552 0.8333
## Specificity 0.9671 0.96605 0.9709 0.9843 0.9784 0.9704
## Pos Pred Value 0.7856 0.77281 0.8220 0.9109 0.8805 0.8244
## Neg Pred Value 0.9546 0.94968 0.9677 0.9938 0.9924 0.9722
## Prevalence 0.1429 0.14286 0.1429 0.1429 0.1429 0.1429
## Detection Rate 0.1034 0.09899 0.1151 0.1376 0.1365 0.1190
## Detection Prevalence 0.1316 0.12809 0.1400 0.1510 0.1550 0.1444
## Balanced Accuracy 0.8454 0.82948 0.8882 0.9736 0.9668 0.9019
## Class: 7
## Sensitivity 0.9599
## Specificity 0.9851
## Pos Pred Value 0.9147
## Neg Pred Value 0.9933
## Prevalence 0.1429
## Detection Rate 0.1371
## Detection Prevalence 0.1499
## Balanced Accuracy 0.9725
startTimeModule <- proc.time()
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(seedNum)
# Combining the training and test datasets to form the original dataset that will be used for training the final model
xy_complete <- rbind(Xy_train, Xy_test)
# finalModel <- randomForest(targetVar~., xy_complete, mtry=31, na.action=na.omit)
# summary(finalModel)
proc.time()-startTimeModule
## user system elapsed
## 0.019 0.000 0.020
#saveRDS(finalModel, "./finalModel_MultiClass.rds")
if (notifyStatus) email_notify(paste("Model Validation and Final Model Creation Completed!",date()))
proc.time()-startTimeScript
## user system elapsed
## 16857.372 89.837 16115.500